Skip to content

Conversation

@holmeso
Copy link
Contributor

@holmeso holmeso commented Jul 14, 2023

fix(qvisualise/qprofiler): deal with new fastq header and make qvis more robust to deal with these

2 issues here, qprofiler was not aware of this new fastq header format and so each header became a unique instrument (all 28 million of them). Qvisualise would then run out of memory trying to deal
with them all. Have added code to properly deal with the new format, in qprofiler, and
perhaps more importantly, added some limits to the size of collections that can be handled by
qvisualise.

New fastq header:
@SRR14585604.8 A00805:41:HMJJWDRXX:1:1101:16929:1000 length=101
which looks like the NCBI header as described here

What qprofiler will now do for this is to count the number of spaces in the header. If there are 2, it will just send the middle part to be analysed. For any other number of spaces, the header will be treated as it was previously.

qvisualise will now run some checks on the input xml file size along with the supplied Xmx parameter, and log if it thinks a memory issue may be encountered.
It will also ignore any xml element that has more than 100000 items in it.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Additional unit test have been added.
Code has been run against offending fastq file and has produced satisfactory results

Are WDL Updates Required?

Nope

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

…ore robust to deal with these

2 issues here, qprofiler was not aware of this new fastq header format and so each header became a
unique instrument (all 28 million of them). Qvisualise would then run out of memory trying to deal
with them  all.   HAvave added code to deprperly deal with the new format,  in qprofiler, and
perhaps more importantly, addeded some limits to the size of collections that can be handled by
qvisualise.
@delocalizer delocalizer self-assigned this Jul 14, 2023
Copy link
Contributor

@delocalizer delocalizer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. A couple of minor changes suggested.

String [] firstElementParams = params[0].split(" ");
if (firstElementParams.length != 2) {
throw new UnsupportedOperationException("Incorrect header format encountered in parseFiveElementHeader. Expected '@ERR091788.3104 HSQ955_155:2:1101:13051:2071/2' but recieved: " + Arrays.deepToString(params));
throw new UnsupportedOperationException("Incorrect header format encountered in parseFiveElementHeader. Expected '@ERR091788.3104 HSQ955_155:2:1101:13051:2071/2' but received: " + Arrays.deepToString(params));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and ll 299, 306 the Exception messages are a bit confusing because they refer not to a generic format but to a specific example as 'Expected'; is there a better wording for these?

logger.info("supplied Xmx memory: " + xmxSize);
double ratio = (double)xmxSize / fileSize;
logger.info("memory / file size ratio: " + ratio);
if (ratio < 8) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some comment about this heuristic?

map.put(isInteger ? (T) Integer.valueOf(tallyItemElement.getAttribute("value"))
: (T) tallyItemElement.getAttribute("value") , new AtomicLong(count));

if (tallyItemsNL.getLength() < 100000) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

magic number -> constant

}
}
} else {
log.warn(cycleElement.getTagName() + " has too many elements: " + tallyItemsNL.getLength() + " - will leave out of report");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and refer to the exceeded constant value here in the log message

@holmeso holmeso merged commit d3a14d8 into master Jul 17, 2023
@holmeso holmeso deleted the qp_fastq_read_id_bug branch July 17, 2023 06:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants